Data Source: https://www.kaggle.com/datasets/rohanrao/formula-1-world-championship-1950-2020
This dataset contains data related to Formula 1 races, car races. There are variables explaining the time of a pit-stop, that is the time that a car takes to change it?s wheels on boxes; the driver of that car, the number of stops that car did already in the race and the lap number in which the car did that stop.
In this Study we are going to analyze the time that the car loses in stopping to change it?s wheels, i.e, our ideal target is to predict the time of a pit stop in milliseconds.
First of all we load the data and remove redundant columns.
rm(list = ls())
setwd("C:/Users/alejo/OneDrive - Universidad Carlos III de Madrid/UC3M/Año 3/Segundo Cuatri/Bayesian/project/data")
data <- read.csv("pit_stops.csv")
drivers <- read.csv("drivers.csv")
races <- read.csv("races.csv")
circuits = read.csv("circuits.csv")
data = data[, -6] #remove time in seconds.
data = data[, -5] #remove time in hours
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(MCMCpack)
## Warning: package 'MCMCpack' was built under R version 4.2.3
## Loading required package: coda
## Warning: package 'coda' was built under R version 4.2.3
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked _by_ '.GlobalEnv':
##
## drivers
## ##
## ## Markov Chain Monte Carlo Package (MCMCpack)
## ## Copyright (C) 2003-2023 Andrew D. Martin, Kevin M. Quinn, and Jong Hee Park
## ##
## ## Support provided by the U.S. National Science Foundation
## ## (Grants SES-0350646 and SES-0350613)
## ##
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
summary(data$milliseconds)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12897 21908 23557 72402 26187 3069017
ggplot(data=data) + aes(y = milliseconds) + geom_boxplot(fill = "darkgreen") + ggtitle("Pit Stop Time (ms)")
Here we can see that we have some pit stops of around 300 seconds. A
normal pit stop should not last longer than 30-40 seconds, so that is
quite unusual, we draw the boxplot to check for outliers and I have
decided to remove the pit stops that last longer than 40 seconds, as
they are the ones were there car has suffered some kind of accident and
many parts of the car had to be replaced.
We are going to focus on analyzing only the times where the car only changes it?s wheels as we do not have data supporting the condition in which the car arrives to the box.
data = data[!(data$milliseconds >= 40000),] #removing rows with pit_stops that last more than 40 secs.
ggplot(data=data) + aes(y = milliseconds) + geom_boxplot(fill = "darkgreen") + ggtitle("Pit Stop Time (ms)")
# We could even decrease more the stop time, let?s just check:
ggplot(data=data) + aes(y = milliseconds) + ylim(0,35000)+ geom_boxplot(fill = "darkgreen") + ggtitle("Pit Stop Time (ms)")
## Warning: Removed 165 rows containing non-finite values (`stat_boxplot()`).
summary(data$milliseconds)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12897 21831 23376 24142 25471 39980
We have removed all the pit-stops that last longer than 40 seconds, that are the ones where some problems related to the car were produced and we cannot control them.
As our target is to predict the time for a pit stop and we have around 200 different races, it is important to have instead of a race id, the circuit in which that race was run so that the same circuits will have similar pit stop times as the shape of the track is the same.
so that we can use them as categorical values.
old_data = data
#Let?s run the data that joins the races with the circuitIds and merge it with our data:
races = races[, c("raceId", "circuitId")]
data <- merge(data, races, by = "raceId")
data = data[, -1] #remove raceId as is no longer of interest for us.
# I am also interested in having the circuits as a categorical variable:
circuits = circuits[, c("circuitId", "circuitRef")]
data = merge(data, circuits, by = "circuitId")
# I am also interested in having the drivers as a categorical variable:
drivers = drivers[, c("driverId", "driverRef")]
data = merge(data, drivers, by = "driverId")
data$circuitRef = as.factor(data$circuitRef)
data$driverRef = as.factor(data$driverRef)
# Removing numerical variables as we have the categorical ones we are going to use:
data = data[,-1]
data = data[,-1]
#Now that the data is more clean we are going to try to understand how each NUMERICAL variable is related to our target:
attach(data)
hist(stop,col="darkgreen"); hist(lap,col="darkgreen"); hist(milliseconds,col="darkgreen");hist(log(milliseconds),col="darkgreen");
summary(milliseconds)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12897 21831 23376 24142 25471 39980
sd(milliseconds)
## [1] 3890.75
Checking this plots we can see that the number of stops on a race is usually 1 or 2 and it?s a rare event to have more. Regarding the number of laps we can see that races use to have around 40-60 laps, and the pit stops are done between laps 10 and 40 mostly.
We are not going to use the variables stop and laps as those variables are not related at all with the time a car takes to stop, we are going to see that now with a correlation plot.
Before that, remark that as we have removed many outliers, for the variable we want to estimate, the milliseconds, we have long tails at the sides, what is going to be bad to define a distribution.
Now, we are checking for correlation between the variables:
ggcorr(data, label=T)
## Warning in ggcorr(data, label = T): data in column(s) 'circuitRef', 'driverRef'
## are not numeric and were ignored
The only correlation is between stop and lap. This is not interesting
for us as the higher the time goes in a race, the higher the number of
laps have been run and the higher the number of stops a car will have
done.
Now we are going to see 2 very interesting plots regarding the categorical variables that allow us to see how relavant they are:
top_circuits <- names(sort(table(data$circuitRef), decreasing = TRUE))[1:15]
data_filtered <- subset(data, circuitRef %in% top_circuits)
ggplot(data_filtered, aes(fill = circuitRef, y = milliseconds)) + geom_boxplot() + labs(title= "Pit-Stop time of Top 15 Circuits")
Depending on the circuit, the distribution of the time in a pit-stop changes a lot. This is the variable we are going to use to predict the time of a pit-stop as depending on the shape of the circuit we could have a different expected time for a pit-stop.
# Filter the drivers to only include the top 10, with more races.
top_drivers <- names(sort(table(data$driverRef), decreasing = TRUE))[1:15]
data_filtered <- subset(data, driverRef %in% top_drivers)
ggplot(data_filtered, aes(fill = driverRef, y = milliseconds)) + geom_boxplot() + labs(title= "Pit-Stop time of Top 15 Drivers")
Regarding the drivers, we can see on the boxplots that more or less all of them follow the same distribution and times on the pit-stops. They will not be significant to predict the milliseconds, but we are going to check it with the density plot too.
p = ggplot(data)+aes(x=milliseconds, fill = circuitRef) + geom_density(alpha = 0.5) + ggtitle("Pit-Stops(ms) depending on the circuit"); ggplotly(p)
Here we confirm our hypothesis. Each circuit has a diferent time distribution, as the shape is different and the time for a pit stop changes.
p = ggplot(data_filtered)+aes(x=milliseconds, fill = driverRef) + geom_density(alpha = 0.5) + ggtitle("Pit-Stops(ms) depending on the driver"); ggplotly(p)
By looking at this graph we can state that the driver is not useful to know the pit-stop time of a car in a circuit as all the drivers follow the same distribution. But we can also guess that the distribution has 2 main maximums so this may mean that it is bivariate
First of all, given all the observations for each circuit, we are going to compute the mean time for all of them and check the results. Any result that is worse than this one, will be considered not usefull at all.
results_benchmark <- aggregate(milliseconds ~ circuitRef, data = data, FUN = mean)
aux = data[, c("circuitRef", "milliseconds")]
results_benchmark <- merge(aux, results_benchmark, by = "circuitRef")
ggplot(results_benchmark, aes(x = milliseconds.x, y = ..density..)) +
geom_histogram(fill = "darkgreen", bins = 40) +
geom_density(aes(x = milliseconds.y), color = "black", size = 1) +
labs(x = "Milliseconds", y = "Density", title = "Benchmark averaging the value for each circuit") +
theme_bw()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
This is the easiest result we can get by computing the mean of each
circuit and predicting all the time with that mean.
We are getting R^2 predicting the time using CircuitRef and assuming that the data follows a Gamma distribution:
# Get unique driver IDs from the dataset
circuit = unique(data$circuitRef)
# Get fit model
fit = glm(milliseconds~circuitRef, data=data, family="gaussian")
summary(fit)
##
## Call:
## glm(formula = milliseconds ~ circuitRef, family = "gaussian",
## data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -10049.7 -1195.8 -276.6 685.4 17673.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23202.0 164.4 141.154 < 2e-16 ***
## circuitRefamericas 1887.0 226.6 8.327 < 2e-16 ***
## circuitRefbahrain 1835.2 203.1 9.034 < 2e-16 ***
## circuitRefbaku -2263.4 257.7 -8.782 < 2e-16 ***
## circuitRefbuddh 440.3 309.9 1.421 0.155441
## circuitRefcatalunya -1012.0 202.4 -5.001 5.81e-07 ***
## circuitRefhockenheimring -2645.5 234.1 -11.303 < 2e-16 ***
## circuitRefhungaroring -830.3 202.9 -4.092 4.31e-05 ***
## circuitRefimola 7974.8 359.6 22.180 < 2e-16 ***
## circuitRefinterlagos -235.5 203.5 -1.157 0.247134
## circuitRefistanbul 126.8 293.2 0.432 0.665425
## circuitRefjeddah -793.3 536.3 -1.479 0.139126
## circuitReflosail 3216.1 544.3 5.908 3.58e-09 ***
## circuitRefmarina_bay 6840.2 209.2 32.691 < 2e-16 ***
## circuitRefmiami -3088.8 591.7 -5.220 1.83e-07 ***
## circuitRefmonaco 3152.8 227.6 13.853 < 2e-16 ***
## circuitRefmonza 1855.6 229.8 8.073 7.70e-16 ***
## circuitRefmugello -917.1 473.3 -1.937 0.052730 .
## circuitRefnurburgring -1388.0 293.2 -4.733 2.24e-06 ***
## circuitRefportimao 1984.3 349.4 5.680 1.39e-08 ***
## circuitRefred_bull_ring -1160.4 223.4 -5.194 2.10e-07 ***
## circuitRefricard 8125.7 347.8 23.364 < 2e-16 ***
## circuitRefrodriguez 226.1 263.3 0.859 0.390467
## circuitRefsepang 1966.7 221.3 8.887 < 2e-16 ***
## circuitRefshanghai -784.4 211.0 -3.718 0.000202 ***
## circuitRefsilverstone 5184.3 212.8 24.359 < 2e-16 ***
## circuitRefsochi 7534.6 264.1 28.527 < 2e-16 ***
## circuitRefspa 111.7 216.9 0.515 0.606804
## circuitRefsuzuka 519.3 217.6 2.386 0.017055 *
## circuitRefvalencia -1232.5 312.9 -3.940 8.22e-05 ***
## circuitRefvilleneuve 585.6 225.8 2.593 0.009519 **
## circuitRefyas_marina -1018.4 220.0 -4.629 3.72e-06 ***
## circuitRefyeongam -910.0 292.5 -3.111 0.001871 **
## circuitRefzandvoort -3703.5 328.3 -11.280 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 8078615)
##
## Null deviance: 1.3877e+11 on 9167 degrees of freedom
## Residual deviance: 7.3790e+10 on 9134 degrees of freedom
## AIC: 171868
##
## Number of Fisher Scoring iterations: 2
# Initialize vector to store predictions
frequentist_means <- rep(NA, length(circuit))
# Fit GLM and make predictions for each driver
for (i in 1:length(circuit)) {
frequentist_means[i] <- predict(fit, newdata = data.frame(circuitRef = circuit[i]), type = "response")
}
# Combine results into a data frame
results <- data.frame(circuitRef = circuit, frequentist_time = frequentist_means)
aux = data[, c("circuitRef", "milliseconds")]
results <- merge(aux, results, by = "circuitRef")
MSE = (1 / length(results) * sum(((results$milliseconds - results$frequentist_time)/mean(results$milliseconds))^2)); MSE
## [1] 42.20223
R_squared = cor(results$milliseconds, results$frequentist_time)^2; R_squared
## [1] 0.4682543
# It?s the correlation squared.
# Explains how well the model explains the variability of the milliseconds.
On the model summary we can see that all the betas are significant as
none is zero.
Here we have used the categorical value of the circuit, the name, and
the model explains 46% of the variability of the pit-stops. Which seems
to be a quite good result.
Notice that this is not exactly the MSE error as I am dividing the results over the mean average time for that circuit.I will not trust this “MSE” number but is just to compare.
Just for comparison, we are going to apply the glm model with families “gaussian”, “Gamma” and “inverse.gaussian” and compare the results in a table.
families = c("gaussian", "Gamma", "inverse.gaussian")
# Get unique driver IDs from the dataset
circuit = unique(data$circuitRef)
Rs_squared = AICs = MSEs = Deviances = rep(NA, length(families))
# Get fit model
for (j in 1:length(families)) {
fit = glm(milliseconds~circuitRef, data=data, family=families[j])
# Initialize vector to store predictions
frequentist_means <- rep(NA, length(circuit))
# Fit GLM and make predictions for each driver
for (i in 1:length(circuit)) {
frequentist_means[i] <- predict(fit, newdata = data.frame(circuitRef = circuit[i]), type = "response")
}
# Combine results into a data frame
results <- data.frame(circuitRef = circuit, frequentist_time = frequentist_means)
aux = data[, c("circuitRef", "milliseconds")]
results <- merge(aux, results, by = "circuitRef")
MSEs[j] = (1 / length(results) * sum((results$milliseconds - results$frequentist_time)^2));
Rs_squared[j] = cor(results$milliseconds, results$frequentist_time)^2; R_squared # It?s the correlation squared.
# Explains how well the model explains the variability of the milliseconds.
AICs[j] = fit$aic
Deviances[j] = fit$null.deviance
}
models <- data.frame(Family = families, R_Squared = Rs_squared, AIC =AICs, MSE = MSEs, Deviance = Deviances); models
## Family R_Squared AIC MSE Deviance
## 1 gaussian 0.4682543 171868.2 24596689450 1.387695e+11
## 2 Gamma 0.4682543 171273.6 24596689450 2.255710e+02
## 3 inverse.gaussian 0.4682543 171281.3 24596689450 9.309282e-03
A gaussian model may be good, but the AIC for a Gamma model is a bit lower. This seems to be correct as time is well predicted using a Gamma distribution in general.
At the end of this document we will follow a Gamma distribution using a Bayesian approach.
We do not expect any variations in this results, but we are doing it just for checking the results.
# Get unique driver IDs from the dataset
driver = unique(data$driverRef)
# Get fit model
fit = glm(milliseconds~driverRef, data=data, family="Gamma")
# Initialize vector to store predictions
frequentist_means <- rep(NA, length(driver))
# Fit GLM and make predictions for each driver
for (i in 1:length(driver)) {
frequentist_means[i] <- predict(fit, newdata = data.frame(driverRef = driver[i]), type = "response")
}
# Combine results into a data frame
results <- data.frame(driverRef = driver, frequentist_time = frequentist_means)
aux = data[, c("driverRef", "milliseconds")]
results <- merge(aux, results, by = "driverRef")
#MSE = (1/n) * sum((y_true - y_pred)^2)
#R-squared = 1 - (sum((y_true - y_pred)^2) / sum((y_true - mean(y_true))^2))
MSE = (1 / length(results) * sum(((results$milliseconds - results$frequentist_time)/mean(results$milliseconds))^2)); MSE
## [1] 76.89451
R_squared = cor(results$milliseconds, results$frequentist_time)^2; R_squared # It?s the correlation squared.
## [1] 0.0311336
# Explains how well the model explains the variability of the milliseconds.
remove(aux)
We can see that here the R^2 is super slow, as expected. (Driver?s skills does not depend on the time of a pit stop as it does the circuit in which it is done). So we will not use the drivers to predict a pit-stop time.
Just to prove it again, for a given driver, the time stop we get.
# Get unique driver IDs from the dataset
circuit = unique(data$circuitRef)
# Get fit model
fit = glm(milliseconds~driverRef+circuitRef, data=data, family="Gamma")
# Initialize vector to store predictions
frequentist_means <- rep(NA, length(circuit))
# Fit GLM and make predictions for each driver
for (i in 1:length(circuit)) {
frequentist_means[i] <- predict(fit, newdata = data.frame(driverRef = "alonso", circuitRef = circuit[i]), type = "response")
}
# Combine results into a data frame
results <- data.frame(circuitRef = circuit, frequentist_time = frequentist_means)
aux = data[, c("circuitRef", "milliseconds")]
results <- merge(aux, results, by = "circuitRef")
MSE = (1 / length(results) * sum(((results$milliseconds - results$frequentist_time)/mean(results$milliseconds))^2)); MSE
## [1] 44.74101
R_squared = cor(results$milliseconds, results$frequentist_time)^2; R_squared # It?s the correlation squared.
## [1] 0.4674421
# Explains how well the model explains the variability of the milliseconds.
R^2 is more or less the same as before and the results do not change yet.
ggplot(results, aes(x = milliseconds, y = ..density..)) +
geom_histogram(fill = "darkgreen", bins = 40) +
geom_density(aes(x = frequentist_time), color = "black", size = 1) +
labs(x = "Milliseconds", y = "Density", title = "Histogram with Frequentist Prediction for an specific driver") +
theme_bw()
We can see that the pilot is not relevant for a lap time. Results don?t
change with or without it. This 46% is the same one as before. And the
histogram is also quite similar.
Up to now, we have done lot of trials but we are not getting better results than the ones from the benchmark, we can see that on the graph itself.
If we apply the log to the milliseconds data, this is the results we get:
data$milliseconds = log(data$milliseconds)
# Get unique driver IDs from the dataset
circuit = unique(data$circuitRef)
# Get fit model
fit = glm(milliseconds~circuitRef, data=data, family="Gamma")
# Initialize vector to store predictions
frequentist_means <- rep(NA, length(circuit))
# Fit GLM and make predictions for each driver
for (i in 1:length(circuit)) {
frequentist_means[i] <- predict(fit, newdata = data.frame(circuitRef = circuit[i]), type = "response")
}
# Combine results into a data frame
results <- data.frame(circuitRef = circuit, frequentist_time = frequentist_means)
aux = data[, c("circuitRef", "milliseconds")]
results <- merge(aux, results, by = "circuitRef")
results$milliseconds = exp(results$milliseconds)
results$frequentist_time = exp(results$frequentist_time)
MSE = (1 / length(results) * sum(((results$milliseconds - results$frequentist_time)/mean(results$milliseconds))^2)); MSE
## [1] 42.34431
R_squared = cor(results$milliseconds, results$frequentist_time)^2; R_squared # It?s the correlation squared.
## [1] 0.4680891
# Explains how well the model explains the variability of the milliseconds.
data$milliseconds = exp(data$milliseconds)
ggplot(results, aes(x = milliseconds, y = ..density..)) +
geom_histogram(fill = "darkgreen", bins = 40) +
geom_density(aes(x = frequentist_time), color = "black", size = 1) +
labs(x = "Milliseconds", y = "Density", title = "Frequentist Prediction applying logarithms") +
theme_bw()
So mostly the same as before. I applied logarithms to the original data,
computed the model and I have done the exponential before the plot just
to show it. Let?s go for a Bayesian approach and hope we get better
results.
bayes.model <- MCMCregress(milliseconds ~ circuitRef, data=data, burnin= 1000, nitt = 10000)
summary(bayes.model)
##
## Iterations = 1001:11000
## Thinning interval = 1
## Number of chains = 1
## Sample size per chain = 10000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## (Intercept) 23200.4 163.5 1.635 1.635
## circuitRefamericas 1889.7 226.2 2.262 2.308
## circuitRefbahrain 1836.0 202.8 2.028 2.060
## circuitRefbaku -2263.7 255.8 2.558 2.558
## circuitRefbuddh 442.6 309.9 3.099 3.099
## circuitRefcatalunya -1009.6 200.9 2.009 2.009
## circuitRefhockenheimring -2641.9 230.3 2.303 2.303
## circuitRefhungaroring -828.0 201.7 2.017 2.017
## circuitRefimola 7977.8 357.6 3.576 3.576
## circuitRefinterlagos -233.2 205.4 2.054 2.054
## circuitRefistanbul 127.2 292.9 2.929 2.929
## circuitRefjeddah -788.6 527.3 5.273 5.273
## circuitReflosail 3209.2 539.9 5.399 5.399
## circuitRefmarina_bay 6840.4 208.0 2.080 2.080
## circuitRefmiami -3087.8 588.1 5.881 6.021
## circuitRefmonaco 3154.9 226.5 2.265 2.265
## circuitRefmonza 1859.5 230.8 2.308 2.308
## circuitRefmugello -912.3 470.7 4.707 4.707
## circuitRefnurburgring -1386.2 289.1 2.891 2.941
## circuitRefportimao 1984.3 352.1 3.521 3.521
## circuitRefred_bull_ring -1160.4 224.7 2.247 2.247
## circuitRefricard 8129.4 348.8 3.488 3.488
## circuitRefrodriguez 228.4 261.6 2.616 2.616
## circuitRefsepang 1969.4 222.3 2.223 2.223
## circuitRefshanghai -783.4 211.6 2.116 2.116
## circuitRefsilverstone 5186.2 211.9 2.119 2.202
## circuitRefsochi 7534.6 265.7 2.657 2.705
## circuitRefspa 111.6 217.1 2.171 2.171
## circuitRefsuzuka 521.1 216.9 2.169 2.169
## circuitRefvalencia -1230.3 314.7 3.147 3.147
## circuitRefvilleneuve 588.4 224.9 2.249 2.249
## circuitRefyas_marina -1014.5 220.2 2.202 2.202
## circuitRefyeongam -907.7 294.1 2.941 2.941
## circuitRefzandvoort -3699.8 326.5 3.265 3.202
## sigma2 8081914.5 121694.8 1216.948 1216.948
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## (Intercept) 22874.45 23090.31 23201.7 23309.54 2.352e+04
## circuitRefamericas 1449.57 1737.43 1888.5 2039.93 2.337e+03
## circuitRefbahrain 1447.75 1700.15 1833.1 1970.27 2.236e+03
## circuitRefbaku -2756.00 -2436.69 -2265.9 -2093.89 -1.749e+03
## circuitRefbuddh -164.03 232.01 439.6 652.60 1.054e+03
## circuitRefcatalunya -1403.25 -1142.86 -1008.4 -873.74 -6.184e+02
## circuitRefhockenheimring -3090.48 -2798.32 -2644.4 -2485.84 -2.189e+03
## circuitRefhungaroring -1221.89 -964.62 -828.2 -691.74 -4.288e+02
## circuitRefimola 7271.92 7739.95 7975.2 8223.98 8.661e+03
## circuitRefinterlagos -631.54 -369.71 -235.2 -94.49 1.704e+02
## circuitRefistanbul -449.06 -72.78 128.1 321.69 7.050e+02
## circuitRefjeddah -1809.97 -1153.94 -791.6 -427.24 2.375e+02
## circuitReflosail 2152.33 2842.49 3210.9 3574.31 4.276e+03
## circuitRefmarina_bay 6437.62 6701.35 6839.9 6976.98 7.251e+03
## circuitRefmiami -4257.95 -3477.51 -3095.6 -2688.27 -1.938e+03
## circuitRefmonaco 2717.63 3002.03 3155.2 3306.17 3.602e+03
## circuitRefmonza 1409.00 1702.22 1859.3 2015.59 2.309e+03
## circuitRefmugello -1848.25 -1231.99 -907.1 -596.37 3.659e+00
## circuitRefnurburgring -1942.60 -1585.80 -1388.2 -1188.76 -8.224e+02
## circuitRefportimao 1301.10 1743.77 1983.1 2225.60 2.670e+03
## circuitRefred_bull_ring -1599.31 -1315.04 -1158.5 -1011.66 -7.124e+02
## circuitRefricard 7446.37 7891.71 8127.0 8366.60 8.815e+03
## circuitRefrodriguez -277.96 50.53 225.1 406.54 7.400e+02
## circuitRefsepang 1529.93 1820.73 1970.7 2118.40 2.410e+03
## circuitRefshanghai -1197.24 -923.93 -783.6 -643.50 -3.697e+02
## circuitRefsilverstone 4773.72 5042.91 5184.9 5330.04 5.603e+03
## circuitRefsochi 7028.03 7350.42 7533.5 7714.10 8.051e+03
## circuitRefspa -308.47 -34.98 112.0 258.55 5.283e+02
## circuitRefsuzuka 99.55 377.28 521.8 666.15 9.521e+02
## circuitRefvalencia -1838.43 -1442.70 -1234.4 -1020.44 -6.065e+02
## circuitRefvilleneuve 147.72 437.07 589.7 739.34 1.027e+03
## circuitRefyas_marina -1442.82 -1163.42 -1016.8 -866.05 -5.801e+02
## circuitRefyeongam -1489.53 -1105.32 -904.2 -707.47 -3.316e+02
## circuitRefzandvoort -4327.40 -3922.26 -3703.5 -3481.57 -3.062e+03
## sigma2 7850005.14 7998443.86 8080393.9 8164539.11 8.324e+06
#Checking for autocorrelation:
acf(bayes.model[,1])
acf(bayes.model[,2])
acf(bayes.model[,20])
As we can see, there is almost no autocorrelation so the model can be
implemented and used.
We are going to compute the predictive distribution for circuit “Americas”
avg = bayes.model[,1] + bayes.model[,2] # 1 corresponds to albert-park, 2 to americas circuit and so on.
america_circuit = rnorm(length(bayes.model[,1]), mean = mean(avg), sd = sd(avg)) # Predictive Distribution
hist(america_circuit, col = "darkgreen", main = "Predictive distribution for America Circuit", xlab = "Time (ms)")
mean(america_circuit)
## [1] 25089.89
# This is a predictive interval:
quantile(america_circuit, c(0.025, 0.975)) # It is going to be much width-er than the average.
## 2.5% 97.5%
## 24787.26 25390.28
This is the predicted distribution generating the samples with normal distribution and using MCMCregress.
We are going to compute the predictive distribution for circuit “Imola”
avg = bayes.model[,1] + bayes.model[,9] # 1 corresponds to albert-park, 2 to americas circuit, 3 for Bahrain...
circuit = rnorm(length(bayes.model[,1]), mean = mean(avg), sd = sd(avg)) # Predictive Distribution
# This is a predictive interval:
quantile(circuit, c(0.025, 0.975)) # It is going to be much width-er than the average.
## 2.5% 97.5%
## 30548.63 31811.03
mean(circuit)
## [1] 31177.83
hist(circuit, col = "darkgreen", main = "Predictive distribution for Imola Circuit", xlab = "Time (ms)")
I have chosen this circuit because the value of the Beta that
corresponds to it is quite high meaning that the milliseconds times are
going to be different.
Now let?s get a predicted value using as posterior distribution a Normal.
bayes.model <- MCMCregress(milliseconds ~ circuitRef, data=data, burnin= 1000, nitt = 12000, thin= 10)
Im introducing some thinning in the model just for the case in which there could be some autocorrelation in some of the predictors and we did not notice.
Now, we compute the prediction of the milliseconds for each of the observations of the data but following the Bayesian model:
# Get unique driver IDs from the dataset
circuit = sort(unique(data$circuitRef))
# Get bayes.model model
# Initialize vector to store predictions
bayesian_means <- rep(NA, nrow(data))
# Fit GLM and make predictions for each driver
for (i in 1:nrow(data)) {
index <- which(circuit == data$circuitRef[i])
avg = bayes.model[,index]
if(index>1){
avg = avg + bayes.model[1] # We sum the parameter corresponding to circuitRef plus B0
}
bayesian_means[i] = rnorm(1, mean = mean(avg), sd = sd(avg)) # Predictive Distribution
}
# Combine results into a data frame
results_bayes <- data.frame(circuitRef = data$circuitRef, milliseconds = data$milliseconds,bayesian_time = bayesian_means)
#results$milliseconds = exp(results_bayes$milliseconds)
#MSE = (1/n) * sum((y_true - y_pred)^2)
#R-squared = 1 - (sum((y_true - y_pred)^2) / sum((y_true - mean(y_true))^2))
MSE = (1 / length(results) * sum(((results_bayes$milliseconds - results_bayes$bayesian_time)/mean(results_bayes$milliseconds))^2)); MSE
## [1] 42.47066
R_squared = cor(results_bayes$milliseconds, results_bayes$bayesian_time)^2; R_squared # It?s the correlation squared.
## [1] 0.4655851
# Explains how well the model explains the variability of the milliseconds.
# R_squared = 1 - (sum((results_bayes$milliseconds - results_bayes$bayesian_time)^2) / sum((results_bayes$milliseconds - mean(results_bayes$bayesian_time))^2)); R_squared
hist(results_bayes$milliseconds, col = "darkgreen", main = "Real Distribution")
hist(results_bayes$bayesian_time, col = "darkgreen", main = "Predicted Distribution")
ggplot(results, aes(x = milliseconds, y = ..density..)) +
geom_histogram(fill = "darkgreen", bins = 40) +
geom_density(aes(x = frequentist_time), color = "black", size = 1) +
labs(x = "Milliseconds", y = "Density", title = "Histogram with Previous Frequentist Prediction") +
theme_bw()
The previous plot of Frequentist approach that we can use to compare
models.
ggplot(results_bayes, aes(x = milliseconds, y = ..density..)) +
geom_histogram(fill = "darkgreen", bins = 40, boundary = 100) +
geom_density(aes(x = bayesian_time), color = "black", size = 1) +
labs(x = "Milliseconds", y = "Density", title = "Histogram with Bayesian Prediction") +
theme_bw()
Bayesian results. They are quite similar to the previous approach and to
the benchmark. We are using a normal distribution to predict data that
does not really follow a normal so our results are not so accurate, as
expected.
the good thing about this last prediction is that the model is finding well the 2 main picks of the data that we saw in the original plots.
We are going to use Markov Chain simulation by defining a non-informative prior to compute the posterior and predictive distributions:
#Assume a non-informative prior as we dont know the prediction for a circuit:
nu.min=0; nu.max=100; a=0.01; b=0.01;
n=length(milliseconds)
# Set the number of burnin iterations:
burnin=1000; iters=40000
T=burnin+iters
# Initializing the vectors for nu and lambda:
nu=rep(0,T); lam=rep(0,T)
# initial values
nu[1]=10; lam[1]=5;
pac=0;
# Aplying Markov Chain simulation:
for (t in 2:T) {
lam[t]=rgamma(1,shape=a+n*nu[t-1],rate=b+sum(milliseconds))
nuc=rnorm(1,nu[t-1],sd=0.1)
if(nuc<nu.min || nuc>nu.max){nu[t]=nu[t-1]}
else{
logal=(nuc-1)*sum(log(milliseconds))-n*lgamma(nuc)+n*nuc*log(lam[t])
logal=logal-(nu[t-1]-1)*sum(log(milliseconds))+n*lgamma(nu[t-1])
logal=logal-n*nu[t-1]*log(lam[t])
u=runif(1)
if (log(u)<logal){
nu[t]=nuc; if (t>burnin){pac=pac+1}
}
else nu[t]=nu[t-1]
}
}
nu.post=nu[seq(burnin+1,iters,by=40)]
lam.post=lam[seq(burnin+1,iters,by=40)]
acf(nu.post)
acf(lam.post)
As we still have lot of autocorrelation, let?s try to reduce it, now we
keep only one out of 100 observations:
nu.post=nu[seq(burnin+1,iters,by=100)]
lam.post=lam[seq(burnin+1,iters,by=100)]
ts.plot(nu.post)
ts.plot(lam.post)
acf(nu.post)
acf(lam.post)
# Proportion of accepted values
pac=pac/iters
pac
## [1] 0.586575
We have reduced a lot the autocorrelation. By selecting only 1 of a 100 observations, this is good for the model.
Let?s see the posterior distribution:
hist(nu.post, col = "darkgreen")
hist(lam.post, col = "darkgreen")
We can obtain 95% intervals for nu and lambda:
quantile(nu.post,c(0.025,0.975))
## 2.5% 97.5%
## 39.45077 42.03904
quantile(lam.post,c(0.025,0.975))
## 2.5% 97.5%
## 0.001633564 0.001741510
mean(nu.post); mean(lam.post)
## [1] 40.74967
## [1] 0.001688082
#Prediction probabilities:
M=100000
time.pred=rgamma(M,shape=nu.post,rate=lam.post)
hist(time.pred, col = "darkgreen", main = "Predicted time following a Gamma", xlab = "time (ms)")
Let?s compute the probability of having a pit stop that lasts longer than 25 seconds:
mean(time.pred>25000)
## [1] 0.38805
# Interval for the predicted time of a pit stop:
quantile(time.pred,c(0.025,0.975))
## 2.5% 97.5%
## 17293.14 32108.35
We can see that the interval is quite large, goes from 17 to 32 seconds.
hist(milliseconds,freq=F, col = "darkgreen", main = "Bayesian Prediction following a Gamma", xlab = "time (ms)" )
lines(density(time.pred))
To finish we obtain in green the original distribution of the
milliseconds and in black, the fitted line following a gamma
distribution. This results seem to be better for me than the ones
following Gaussian models that we have seen before as the line fits well
the data.
The problem with the data we had is that we are provided with the circuit in which the race is done and the the time any car takes to stop and change its wheels. The problem is that we do not have lot of information regarding the car status or any kind of condition that helps us determine the time the car will be stopped.
Sometimes this happens with the data and we have to work with it. We are going to predict the time of a pit-stop depending on the circuit and not in the conditions of the car or the race, so the results we get are quite similar to the ones that we can get by hand if we compute the average of all the observations that we hold for each circuit.